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Classrooa observation has the potential ,bf obtai^iing 
valuable inforaatilon regarding teacher and child behavior. This study 
exaaines the accuracy of observers coding a standard stiauli. In this 
procedure the observer's bias is ezaained, as veil as the confidence 
that can be placed in the observation code itself. Through these 
procedures the exact nature of the confusion of codes can £e 
identified^ In order to avoid the^ probleas encouhtered with the 
paired observer aethod, an atteapt was aade to assess the accuracy of 
observers through the use p£ controlled videotape exaaples v}iich 
allov each interaction (a fraae) and Sequences of fraaes to»be 
analyzed fi>r accuracy. j Ten videotaped skits were produced to present 
concise, clear exaaple^ of each code used in' recording classrooa 
interactions on an obs4srvation:instruaent. Conf usability (lov 
observer agreeaent) aatrices ;yere constructed by tallying the 
observer code sequences. Besults of the t^onf usability stady identify 
the specific codas that appear to be reliable as veil as those that 
are confused and need to be redefined. Inter^rater accuracy and 
videotape siaulation accuracy are coapared. Hhile the tvo systeas oZ 
exaaining observer accuracy do yield sope different inforaation, it 
i,s_not contradictory, and the videotape systea is easier to 
interpret. (Author/RC) 
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A STUDY OF CONFUSABILlUr OF C0D1£S 



IN OBSERVATICMAL MEASUREMENT 



As a measurement technique, classroom observation has the potential 
of obtaining valuable information regarding teacher and child behavior. 
This potential is realized tp the extent that the observation data can be 
shown to be reliable. Reliability , as Cronbach et al (1972) have pointed 
outy is related to the notion of generalizabillty from a sample to some 
universe of interest. 

Several sources of unreliability have been identified in past 
research. Medley & Mitzel (1963) say that: 

Most commonly, it [unreliability] occurs when two 
measures of the same class tend to differ too much; 
this may happen because the behaviors are unstable, 
because the observers are unable to ag^e on what - 
occurs, because the different Ite^ which enter into 
the measurement lack consistency, or for some other 
reason. x '''^ 

Neither Cronbach nor Medley address the question of the confusability 
of codes used in observation Instruments. The present study examines the 
accuracy of observers coding a standard stimuli. In this procedure the 
observer's bias is exp kilned, as well as the confidence that can be placed 
in the observation codt: itself. Through these procedures the exact 
nature ef the confusior of codes can be identified. 

In previous SRI rcvliability studies, the technique of pairing the 
observers with an SRI t laiper has been used. However, there are some 
problems in assessing inter-rater reliability. First, there is some 
variability in the coding skills of SRI trainers. Second, there is most 
certainly a variability in the. incidents which occur in the classrooms, 
in what is selected for observation, and in which codes are used in the 
observations. The optimum arrangement might be to have all observers and 
SRI trainers observe the same phenomena in the same classroom at the same 
time. But, as Soar (1973) says: , 

The critical problem (of paired observers) is the 
effect on the classroom of increasing the number of 
obsez*v,ers. One observer represents a threat to many 



teachers and a distraction to the children, at least 
initially, and as the nunber of observers increases, 
these difficulties increase, probably more like a 
geometric function than an arithmetic one. 

In an effort to avoid the problems encountered with the paired 
observer method, SRI staff has attempted to assess the accuracy of 
, observers through the use of controlled videotape examples. This proce- 
dure allows each Interaction (or frame) and sequences of frames to be 
analyzed for accuracy, whereas previousl>r only simple marginal frequency 
counts of single codes could be computed. 

Other investigators in observational research also use videotapes, to 
assess observer accuracy. Soar (1973) used tapes of actual classroom 
events, and Simmel (1973) cleverly used the last ten minutes of the 
Johnny Carson Show to check observer accuracy on a weekly basis. 
Although they are useful, the limitations of videotapes also should be 
recognized: 

• Because of the difficulty in seeing hearing, videotapes 
are more difficult to code than live conversations; 

• It is more difficult to imderstand the gestalt of the 
situation from a tape than it is from a live situation 
in the classroom; 

• Simulated skits are likely to be more clear-*-cut examples 
than those which actually occur in classrooms. 



A, PROCEDURE TO ASSESS THE C(}NFUSABILITY OF OBSERVATION CODES ^ 
1, A Description of Procedures ^ 

Differing from both Soar and Simmel, SRI staff produced ten 
videotaped skits. Each simulation is approximately 20 interaction frames 
long, Thes^ skits attempt to present concise, clear examples of each 
code used in recording classroom interactions on the SRI observation 
instrument. Each skit begins with a still picture and the voice of a 
^ narrator who explains the situation and identifies the focus person. The 

-jp 

These procedures were developed at SRI by J. Philip Baker , Phillip 
Gieseny and Charles Norwood. 
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skit is then shown jit regular speed. After the skit is shown once, the 
still picture and narrator again identify the focus person. Each skit 
is then shown again, this time with a 2- to 3-second pause between each 
interaction. The observers are ix^tructed to code this stop-action 
portion of the skit and to code one frame during each stop or pause. 
2. Procedural Problems 

The reliability coding booklets were returned to SRI an«1 
compared with the criterion sequences. This revealed t^iat some observers 
were coding more than one frame during a pause. Conversely, some 
observers, possibly while turning pages, omitted frames. The trainers 
reviewed the coding sequences and deleted extraneous frames or inserted 
spaces so as to align the observers' sequences with the criterion 
sequences. Three trainers performed this operation. Since Judgment is 
involved, a check was made on the code sequences <vf 10 observers to see 
whether the trainers arranged the sequences in the same manner. The 
average agreement between trainers in arranging these sequences was 
96.4" percent . 

Other procedural problems were also encountered due to the 

- experimental nature of the techniques used. Comments received from the 

^ observers indicated that not all of the equipment utilized to administer 
the tapes was in good condition, and, as a result, the sound or pictures 
were of poor quality. Also, some examples on the criterion t^e were 
technically less than well executed. The most serious problem, however, 
was that there were too few examples of several of the codes on the 
criterion tape. Five or fewer examples of a code limited the assurance 
that representative examples of the code were shown. Further, if an 
observer missed two out of four possible instances of a code, he only had 
a score of 50 percent of the criterion correct; however, if he missed two 
out of 30 possibilities, he had a score of 93 percent of the criterion 
correct. For this reason, the codes which have fewer than six examples 
will not be interpreted in this analysis. The number of examples of each 
code on the complete set of tapes ranges from zero to 40. (This problem 
is being remedied by the development of more skits.) 
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B. A DESCRIPTION OF CONFUSABILITY MATRICES 

Confusabllity of codes refers to codes which were confused vith the 

* 

correct codes by an observer. Confusabllity matrices were constructed by 
tallying the observer code sequences. For each frame, a tally mark was 
entered in the box or cell created by the juncture of the criterion code 
and the code marked by tbe observer. Figure 1 shows an example of a 
confusabllity matrix for the "What" codes. The principal diagonal con- 
tains the ceils indicating correct, coding; other cells contain incorrect 
coding. The column totals are the total number of criterion examples 
ishown on the videotape for each code; the row totals are the total number 
of times an observer recorded each code , whether correctly or not. An 
examination of a particular cell reveals whether the code was recoi*ded 
correctly or incorrectly and, if recorded incdrrectly, shows exactly which 
codes were confused. ^ 

The total number of tallies in each cell can be used to calculate 
the rates of accuracy in two related but distinct ways. The first 
procedure described above allows an examination of observer bias. If the 
number in a given cell is compared to the total number of recordings 
(row total) of the code that pertains to that cell (see the row indicator) , 
a ratio of correct or incorrect responses can be derived. For the cells 
that fall on the main diagonal, the numbers indicate the proportion of 
times the code recorded by an observer was correct. For the cells that 
do not fall on the diagonal, the number indicates the proportion of error. 
This accuracy rate is called the Accuracy Rate of each observer on 
each code; i.e., the ratio of correct or incorrect codes of the total 
number of coded obsers^ations« 

The point of the second procedure is to assess the confidence that 
can be placed in each code. The second accuracy figure can be arrived 
at by comparing the number of tallies located in the^same cell to the 
total number of examples on the criterion tape presented of that specific 
code (see column total). Again, the number arrived at shows the propor- 
tion of correctness to. incorrectness, a& based on whether the cell falls 
on the diagonal. This proportion of times the criterion instances were 
recorded"' correctly or incorrectly is called the Criterion Accuracy Rate. 



3|t 

See Table 1 for a brief explanation of the SRI "What" and "How" codes. 
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Table 1 



SRI •'what" and "how" codes 



"What" Codes 



"How" Codes 



1 - Command or Request 
IQ - Direct Question 

2 " — Open-ended Question 

3 - Response 

4 - Instruction, Explanation 

5 - Comments, Greetings; 

General Action 

6 - Task-related Statement 

7 - Acknowledge 

8 - Praise ' 

9 - Corrective Feedback 

10 - No Response 

11 - Waiting 

12 - Observing, Listening 
NV - Nonverbal 

X - Movement 



H - Hfi^py 

U - Unhappy 

N - Negative 

T - Touch 

Q -^Question 

G r- Guide/Reason 

P - Punish 

0 - Object 

W - Worth 

DP - Dramatic Play/ 
Pretending 

A - Academic 

B - Behavio]^ 
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Figure 2 presents the proportion of cell tallies to the row totals 
for each cell. This provides a matrix that presents the Accuracy Rates 
in place of the raw scores shown on Figure 1« For example » the observer 
recorded a "l" code correctly three tlioeSy so the Accuracy Rate is 100, 
or 1.00, In the second row, "iQ" was recorded correctly 20 out of 
21 times, or 95 percent of the total number, of times. 

Looking across the **1Q" row, we see that the observer called an 
example of a "12" a "IQ" five percent of tlie time. For another example, 
we look across the row of code; "5" and see that our observer did not code 

any "5's'' correctly. Instead, she mistakenly coded two examples of 

11-11 »t,-*-.»»» , II-, It 

code 5 as a 5NV and a 6 • ' ■ - 

- In Figiure 3, the proportion of criterion examples for each code 

correctly recorded by the observer are found in the diagonal cells. 

Entries in cells down the columiji marked by the correct code other than 

in the diagonal cells are instarjces of confusion. As can be seen in the 

"1Q" column of Figure 3, code "sj" and code were sometimes confused 

with the "l<j" code. ^The 'bo1;tom /row of the figure presents the correct 

number of criterion examples of' each code which appeared on the video- 

tape. The last column on the table lists the number of times the 

observer recorded each code. In «the code "12** column on Figure 3, the 

observer recorded three mqre "l2*s^* than appeared on the videotapes. 

Apparently, these three were confused with some other cpde. 

Figures 2 and 3 can be overlaid so that the top entry in a cell 

refers to the accuracy of what the observer recorded, and the' lower entry 

refers to the percent of criterion examples which were correctly coded. 

Figure 4 illustrates such an overlay. The combined figure tells us that 

when this observer recorded a "7," it was indeed a "7" (there are no other 

entries in the code row). However, she only recorded eight "7's" and 

there were acttially 11 examples on the videotapes. Looking down the 

column for code "7" and at the lower entry in the cell, we see that 

examples of "7" were recorded^ as "3" nine percent of the time, as "s" 

nine percerit of the time, and as "l2" nine percent of the time. We can 

conclude from this example that when the observer recorded a "7" it was 

truly a "7," but she underestimated the number of times^ they occurred. 

She recorded some of the "7's" as "3," "8," or "12." 
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C. CONFUSION OF^^OTSERVATION CODES (A STUDY C<»«BINING THE RESULTS OF 
63 OBSERVERS) 

Analyses of these matrices were used in two ways. First, by combining 
the results for all observers ^ the extent of general confus ability of 
codes could be examined. Codes that reveal a high rate of ponfusion by 
several observers suggest these p<^i^ible causes: an overlapping 4of code 
definitipns, poor videotape er'imples, or less- than-* adequate training 
procedures. Second, the accuracy of individual observers could be 
examined with these matrices. For example, if an observer were not 
very accurate on code "s" (praise), then codes using "praise" could be 
e:jfcamined for anomalies.^ The f ihdingl^cx^ported here represent 63 observers 
spread among 30 geographical locations. 

How the observers coded the videotaped examples is shown in Figure 5, 
The diagonal shows the number of correct codings. -The tow at the bottom 
of the table is the number of videotape criterion examples coded by all 
of the observers. Each figure in the bottom row can be compared with the 
corresponding cell in the diagonal. For example, code "l" was recorded 
correctly 160 times* out of a possible 245 times. The^other entries in 
the "l" column are sources of confusion. ^ ^ 

The proportion of times that the observers were correct in their 
re<pording^^s presented in Figure 6. For example, the observers recorded 
*'1Q" correctly 80 percent of the time. Looking acres s« the "iQ" row^ we 
sfee'that two percent of the time when a "1Q" was recorded it was truly 
a "l," and 11 percent of the time it was truly a "2," ^ 

The proportion of videotaped examples recorded by the 63 observers 
is shown in Figure 7. The number in the diagonal reports the percent 
recorded correctly. The numbers in the columns outside of the diagonal 
indicate the source of error. If all of the numbers in the columns were 
in the diagonal cell, the result would be 100 percent correct. The total 
number of possible exaiiq;>les on the videotape are listed on the bottom row. 
For example, "1Q" was recorded correctly 77 percent of the time, whereas 

Caution: Even if the observer were 100 percent in agreement, with the 
critejcion examples, in a study of this type generalization would still 
be limited by the day-to-day variability of classroom events. 

Sources of error less than fhree percent are not included on the table. 
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16 percent of th^ time it was confused with the codes "l,", "2," and "3," 
Figure 8 is an overlay of Figures 6 and 7. This provides the data 
necessary to quickly assess the observers* accuracy (by looking at th^ 
topmost entry in a cell) and the percent of criterion codes which have 
been recorded (by looking at the lower entry in a cell). 
, 1. bindings of "What" Code Confusions 

Since a "What" code is required for each interaction, each 
recorded frame must include a recording of a "What" code. The observer 
only has the option of recording the correct (or criterion) code or 
recording the wrong code. The entire f^-ame is considered void if no 
"What" code is recorded. 

/ In Figure 8, four of the "What" codes have, been separated into 
two ^categories : the "What" code alone and the "What" code with its "How" 
modifier. This was done because the meaning or definition of the "What" 
code i\(^ modified or sometimes changed by tfxe addition of these specillc 
^'How" coded. An example of this is the "s" code. The definition of tht 
"5" is "general comment," but the definition of the "SNV" is "general 
action." 

As mentioned earlier, the number of criterion examples for 
some codes is small which limits the conclusions that can be 
drawn regarding these low frequency codes. For this reason, codes with 
fewer than six examples on the videotapes will not be discussed. 

Nine of the .16 "What" codes have six or more criterion .examples 
of each code. These are the shaded diagonal cells in Figure 8, Of these 
nine ("IQ," direct question; "3," response; "4," instruction; "4NV," 
self learning, "SNV," general action; "6," task-related comment; "7," 
acknowledge; "9," corrective feedback* and "12," attending), only "6" 
has an observer accuracy rate that is lower than .70. 

Code "6," task-related comment, was confused most often with 
code "3," response. It was also sometimes confused with examples which 
were actually "l," direct request, "2," open-ended questions, and "9," 
corrective feedback (see row "6"). This suggests that the definitions and 
draining procedures need to be more exact regarding when to code a task- 
related comment "6." The numbers in the lower section of the cell 



-15- 



00 



i 

CO 
CO 

Eh 



a: 



Q 
W 
Q 
05 



8 



CO 
H 
Q 

8 



o 

H 

I 

O 

a. 



iH 

a 

6 
on 

» 
o 



CM 



00 



CO 



• 










m 


















■ 














M o 






















CM 


















o 
ph 












in rH 


•-H 


CO 


i .06 ( 
,03 


























o 






rH 


O 












o 


















O 






















in 
o 


CO 00 
00 










CM 






















m 

rH 
• 




* 




s 


rH 
rH 


i 






CM 
O 








in CO 
in CM 






in 
o 








<o 

rH 


1 














<-jo 


































O (D 




o 














rH 
















in 
o 


o in 

CM iH 
















rH 
rH 


















o 












^co 

• • 


00 




















o 














CO 










00 

o 


o 




in 
o 




in 
o o 


o 










1 


CM 




rH CM 
• ■ 


• • 














CO 

o o 














lO 






00 CM 
• • 


CM CO 
O O 








\ 


















O 


CM lO 
(0 


CMrH 
















CO rH 
O rH 

















i: "O rH rH CM CO CO 

5 o 



in in 



00 O O rH 



CM 



ERLC 



(Moy Xq siiDO ui saaqum^ aaddn) saaAaasqo papaooan sb sapco 



-16- 



O ^ 0) 

^ o cd 

ia CO 

a ara 

rH CO Q) 

(d X 4J 

+J M -H 

P ^< 



(looking down the "6" column) indicate that 23 percent of the time the 
criterion examples of "6" were recorded as "5." 

The next lowest in reliability was the "4" code, instructing. 
Eleven percent of the time the observers recorded what was actually "l2," 
observing, as a "4." Since "4" is verbal and "l2" is nonverbal, the 
problem would not appear to be one of confusion in the normal sense but, 
rather, confusion of which person .to focus upon. This conclusion is based 
on the fact that both of these codes generally occur simultaneously (that 
is, when a teacher is instructing, "4," the children are usually attending, 
"12"). Apparently the observers confused' which person to record. As can 
be seen in Figure 8 in the code "l2" row, a true example of "4" was some- 
times confused and recorded as a "l2," which is a further indication that 
the instrxictions regarding the focus of observation were not clearly 
understood by observers . 

Code "4NV" describes a child working alone on instructing 
himself. Observers recorded this reliably 88 percent of the time. They 
sometimes confused "4NV" with what was truly a "SNV," a code that describes 
"play" rather than "self instruction in a task." Looking down the "4NV" 
column, it can be seen that 15 percent of the videotaped examples were 
recorded as "SNV's." This confusion of "4NV" and "SNV" indicates an 
overlap of deTinitions (or a conceptual difficulty in distinguishing 
"work" from "play"). 

Criterion examples of code acknowledgement, were sometimes 

confused with "6" and "l2." Code "?" is sometimes confused with code "3," 
responding (see row 7 in Figure 8), It is easy to see how acknowledging 
a child can be confused with responding to a child. On the other hand, 
code "3," responding, was one of the more reliable codes. It was not 
confused with "7" (see Figure 8). In fact, the observers recorded it 
correctly 91 percent of the time, and column 3 indicates that five percent 
of the criterion axamples were confused with "6." 

Eleven percent of the recorded code "iQis," asking direct ques- 
tions, were actually code "2*s," asking open-ended questions. The* 
confusion between "1Q" and "2" has long been recognized by the SRI 
researchers. Each year the variables have been defined more carefully; 
however, there still seems to be a gray area of unclarity between the two 
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codes. Code 2, which has too few examples to analyze with confidence, 
was also confused with "IQ." The results of individual observers were 
examined,, cmd apparently those observers who observed models which do not 
often require the "2" code had a higher rate of error. 

Eighty-six percent of the time the observers recorded '*9,'* 
corrective feedback, correctly; five percent of the time code "6" was 
recorded as "9" (see row 9, the upper value). The criterion examples as 
illustrated in column 9 (the lower value) were sometimes recorded as "l," 

II J 11 V, II 

IQ, and 6. 

2. Findings for "How" Code Confusions 

A "How" code is not always required • This rule leads to four 
distinct possibilities: 

• A required "How" was left out of the frame (omission). 
These are listed at the bottom of Figure 9; 

• A "How" code was recorded when not called for (intrusion). 
These are listed in the last column of Figure 9; 

• The criterion "How" code was confused with another code. 
These are entered in other than the diagonal cells; 

• The criterion "How" code was recorded accurately. 
These are entered in the diagonal cells. 

Only six of the 14 "How" codes were represented by six or more 
examples on the videotapes. These are "NV^" "X," "A," "b," "DP," and "o" 
(see Figure 9) . As described on page 3, codes with fewer than six 
examples will not be discussed. Also as described earlier, the upper 
value in a cell reports the percent of observer accuracy. The lower 
value in the cell reports the percent of the videotaped examples which 
were correctly recorded. 

The nonverbal code "NV" was recorded correctly 93 percent of 
the time by observers; and, overall, the observers omitted only 13 percent 
of the criterion examples. Code **X," movement, was also found to be 
reasonably reliable. Eighty-nine percent of the time the observer 
recorded^ it correctly, but 20 percent of the examples were oi« ed by 
observers . 
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Observers recorded the "A" code, academic, correctly 81 percent 
of the time. Four percent of the "A's" recorded were truly code "B," and 
15 percent of the "A's" were actually intrusions* Seventy-six percent of 
the videotaped examples were recorded correctly, and 21 percent were 
omitted. 

While 95 percent of the examples recorded as code "b" by the 
observer were correct (see row B), 43 percent of the "B's" were omitted 
and 13 percent of the examples of "b*s" were incorrectly recorded as. 
"a's*" This leads to the conclusion that if a "b" is recorded, it is 
likely to be correct, but the total number of "b" codes may be under- 
estimated by over 50 percent. An examination of each observer's work is 
important in order to discover the source of the underestimation. It is 
possible that only a few observers are grossly underestimating "b*s," or 
it could be that many of the 63 observers are underestimating "B*s" to 
only a small degree. 

The two remaining codes with six or more examples ("DP," 
dramatic play, and "O," use of objects) were recorded accurately over 
30 percent of the time, but both codes were underestimated (43 percent 
r.nd 33 percent of the time), 
3. Summary 

The results of the conf usability study identify the specific codes 
that appear to be reliable as well as those that are confused and need to be 
redefined. The findings suggest that some codes, such as "6," "4NV," and 
"5NV," should be more carefully defined because of overlapping definitions. 
There is some indication that there should be more careful training of 
observers on the focus of observation so that "4" and "12" (vill not be 
confused. The overall reliability for all observers on the "What" codes 
was 78 percent and 81 percent for the "How" codes , 

D, ACCURACY OF INDIVIDUAL OBSERVERS 

The value of this new method for measuring accuracy is that it 
contributes directly toward interpreting the data. Observer bias can be 
assessed by examining the overuse, underuse, or confusion of codes. In 
this study, each observer was responsible for observing one grade level 



ERIC 



-20- 



at a single site. Therefore, the data collected by each observer is 
identifiable in the analysis. V 

In order to determine the accuracy rates for each observer separately, 
tables were constructed that graphically present, by sponsor, _eadb 
observer's results (see Table 2). Thus, for example, if ^ oimetviBr in 
Grade 1 at Site X had difficulty with the code "7," acknowledgment, it is 
possible to compute the site mean of code and compare it^with the 
first grade means of code at the four other sites of the sponsor. If 
the means of the four sites (not in question) are similar and the mean of 
the site in question differs from the other four, there are two possible 
explanations: (1) Site X may be truly different from the other four 
sites, or (2) the observer at Site X may not be recording accurately. 
In any case, the data resulting from code at Site X would be inter- 
preted with caution. This procedure allows for each observer •s data to 
be reviewed in order to estimate the accuracy of the individual on each 
code and to allow for the data lo be interpreted accordingly. 

As an example, Table 2 shows the observer accuracy rate (the top 
number) and the criterion accuracy rate (the bottom number) for each of 
the Far West observers for each code. In addition, an overall accuracy 
rate lor each observer on all '*What" and "How" codes has been computed and 

* 

displayed on this table to provide a general idea of the observer's skill. 
The results are grouped by grade level and site. Similar tables for the 
other six sponsors in the evaluation were prepared. The complete confus- 
ability matrix of all observers is not included in. this report but is 
available at SRI . 

As previously discussed on page 3^ five or fewer criterion examples 
of a code minimize the confidence with which the actual results can be 



The overall accuracy rate is arrived at by computing the ratio of 
correct recordings (those that fall in Ixi^ diagonal cells) of all codes 
to the total number of recorded codes and to the total number of crite- 
rion instances of the codes. For the "What" codes,, the two ratios are 
the same since the total number of recorded codes is equal to the total 
number of criterion instances. Two ratios are'^'required for the "How" 
codes since observers are not required to record a "How" code in each 
frame which leads to differences between the total numbers of criterion 
examples and total numbers of recorded codes. 
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utilized. Therefore,, only codes with six or more examples are considered 
in the analysis of specific grade levels within a site. 

As an illustration of how Table 2 can be used, the results of the 
first observers listed are discussed. The observers are grouped according 
to the site or project they observed. Going from left to right, the 
"What" codes are first shown on the extreme left with the codes which are 
represented by six or more criterion instances. The next section includes 
the "what" codes that were represented by fewer than six instances. The 
codes are shown next, with a similar division. 
1, Findings from "What" Codes Occurring Six or More Times 

The first observer listed. Observer No. 1 from Site A, had an 
overall reliability rate of .84 on the "What" codes (see Table 2). Of the 
nine codes with six or more criterion instances, only two codes registered 
an "observer accuracy" or "criterion accuracy" rate of less than .75. 
Looking at code "4," instlruct ion, we see an observed accuracy rate of .62 
and a criterion accuracy rate of .89 o This means that when Observer 1 
recorded the "4" code, it was correct 62 percent of the time^ . The observer 
actually recorded a "4" 89 percent of the time; thus, she missed only 11 
percent of the examples. However, 38 percent of the time when she 
recorded "4's" she was incorrect. Therefore, variables using the "4" 
code in the first grade at Site A should be interpreted with caution. 

The other coda which the obseirver^s results show to be consid- 
ered less than adequate was the "5NV" code, nonverbal general action. The 
accuracy rate of 1.00 shows that when she recorded a "5NV" it was always 
a "5NV" — she did not confuse it. However, she failed to^ code 50 percent 
of the videotape examples of "5NV." 

The overall results for Observer No. 1 show that the observation 
data she gathered can be analyzed with a gr'^at deal of confidence. Only 
the "4" £md "5NV"^code results have to be analyzed with special caution. 

Three of the other first grade observers for this sponsor 
registered accuracy rates of over .70 on the "4" code. The first grade 
observer at Site E has an accuracy rate of only .54. If the results of 
the data collec.tion show that Grade 1, Sites A and £, have means and 
standard deviations for code "4" that differ widely from the other sites. 
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^it may be explain^^by the observers* confusion in ^the use of the *4" 
code. 

. A similar situation exisl:s with the data for the four other 
first grade observers on the "SNV" code. The underestimation of th^ code 
by Observer No. 1 at Site A is not coxmnon to all first grade observers. 
Therefore, this^ should be taken intOLConsidera.tioni when the data is 
analyzed. f , , . - ^ 

- 2. ' Pj iridings from " How " Co des .Occur ring Six or More Times 

It can be seen on Table 2 that Observer 1 at Site A was 100 

\ 

percent accurate when sh^ re cord led five of the more frequent "How" codes. 

The ontr coding exception is "A." academic. Only 67 percent of the time 

were her "A'' recordings correct. Thirty- three percent of the time they 

were not "A's." However, she recorded 90 percent of the "A's" actually 

occurring on the videotape. The extra 33. percent that she recorded are 

* 

considered intrusions, >jad they ovel^cstimate the occurrence of this ^ 
, code. Observers at other sites had ^their own specific difficulties, and 
their oata will have to be analyzed In the same way that Observer l*s has 
been analyzed. ^ 

3. Summary • ^ 

The usefulness ^f this method of measuring the accuracy or 
individual observers lies in its capacity to: 

<• 

• Differentiate codes according to relatively high or lo# 
levels of confidence; . 

• Assess an individual *£* coding' skill on a specific code and 
examine- observer bias; ^ 

• Compare individual observer's scores with other observer's 
scores at j;^^, sponsor 's same grade level. 

By thus identifying the various sources of error in the 
observation measures, we can more accinrately determine whether specific 
problems lie in the code itself or with the indivic^ual observer emd inter- 
pret the data accordingly. _ f 
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See page 18 for an explanation of intrusion. 
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E. A STUDY COMPARING INTER-RATER ACCURACY AND VIDEOTAPE SIMUMTION 
ACCURACY > ' 

The preceding section has examined the con f us ability of the 
observation codes and the ability 'of observers tO co<je ''criterion videotapes. 
Videotaped simulation of classroom events are, admittedly, different frcm 
actual classroom events. In an effort to compare the accuracy of 
observer ratings on the simulations and inter-rater accuracy in class- 
room^y a small study was conducted in one location,. This section compares 
the results bbtained from both studies of accuracy for two observers. ^ 

!• Paired Obseirvers 

The first method, the paired observation, is the most commonly 
used methocjl of assessing interaction analysis instruments. The procedure 
followed is to have the two observers situated in the ^ame classroom, 
coding exactly the same situation simultaneously. The recorded codes are 
then evaj^ated in terms of percent a'greement between the two observers. 
Since the speed of the two observers is not expected to be consistent, the 
ratio of the number of codes recorded by the observer is ccnnpared to the 
ratio of the number of codes recorded by the trainer. 

It must be pointed out that this paired observation procedure 
has some serious limitations. First, two extra people in the classroom 
are more obtrusive than one. Second, it 'is almost impossible to assure 
that the two observers are focusing on exactly the same action. Due to 
limited space, the two observers may not have the same angle of observa- 
tion; thus^ what they see and hear may be somewhat different and yet each 
observer could be collecting a^correct and adequate sample of the behavior 
which is occurring. A third problem is that even if the marginal fre- 
'quency counts of a code by two observers are numerically similar, we 
cannot be certain that the two observe;rs have recorded specific incidents 
exactly the same. Similai^ ratios could occur by chance. Lastly, it 
happens that during the classroom observations certain interactions or 
codes do not occur, or occur at such a minimal late, that reliability 
cannot be computed/' There is no way to be certain that all codes will be 
assessed within a given time period. 

In the study, data from the 16 five-minute observations were 
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exaininedy using three "Who" codes , twelve "What" codes, arid thirteen "How" 
ccxles. To assess the coding accuracy of the two observers, the 
proportion of frames that contained a particular code was recorded for 
each trainer and trainee. From the proportions, the following equation 
was computed for each code (p is for the trainer and q is for a given 
observer on a given code) : 

♦ 

The percent agreement = 100 x yf'^ 

max vqiP; 

Tables 3 and 4 show the overall percent^.ge reliability of the 
codes separately in terms of their ratio of frequency. It must be noted 
that accuracy for low ^ frequency variables is difficult to interpret 
because if one observer records an event four times'" and the other only two 
times and they observe an equal number of frames, the agreement is only 
50 percent, even though the actual difference is only two occurrences. 
Higher frequency variables can tolerate a difference of two occurrences 
and still show st high percentage of agreement. The data for/elich observer 
is presented separately in Tables 3 and 4« Since there are 16 paired 
observations y it is possible to have as many as 1,216 frames of inter- 
action. Therefore, we have separated the data into three categories: 
least frequent, moderately frequent, and most frequent. Table 5 is 
included to further clarify the results of the paired observations. It 
includes the frequency scores of the SRI trainer as well as the ratios of 
occurrence and percent agreement scores for both observers over all codes. 

The results show that both of the observers were very reliable 
on the "Who" codes. The "What" codes were also recorded very reliably, 
with only two exceptions. Observer 1 recorded less than half as many "s" 
codes (praise) as the criterion observer, and Observer 2 missed nearly 
80 percent of the occurrences of code "6" (task related statement). 
Significantly, however, both of these codes occurred with low frequency. 

The results on the "How" codes were much lower. Observer 1 was 
quite reliable on the "nv" (nonverbal), "g" (guide to alternative), "A" 
(academic), and "b" (behavior) codes. She was below the 50 percent 



When p = 0 and q = 0, the percent agreement is assigned a value of 100, 
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Table 3 

PERCENT AGREEMENT BETWEEN TRAINER/OBSERVER 1 

WHO CODES 



Percent 
AgreCTient 
91-100 
81-90 
71-80 
61-70 
51-60 
41-50 



Least Frequent 
(0-60) 

Machine 



Moderately Frequent 
(61-175) 



Most Frequent 

(X76«-l,216) 
Adult, - CM Id 



Total No. 
of Codes 
T~ 



TOTAL 



WHAT CODES 



Percent 
Agreement 
91'-100 
81-90 
71-80 
61-70 
51-60 
41-50 



Least Frequent 

(0-60) 
To ' 

11 

7 
8 



Moderately Frequent 
(61-175) 

IQ, 6 
5, 9 



Most Frequent 

(176-1,216) 
T""" 



12 



TOTAL 



Total No. 
of Codes 
" 5 — 

5 

1 

2 

1 

1 

13 



HOW CODES 



Percent 
, Agreement 
91-100 
81-90 
71-80 
61-70 
51-60 
41-50 
31-40 
21-30 
11-20 
0-10 



Least Frequent Moderately 'Frequent Most Frequent Total No. 

(0-60) (61-175) ' (176-1,216) of Codes 

U, G, DP 



X 
T 

Q 

N, O, W 



B 



H 



A^ 
NV 



TOTAL 



1 
1 

1 
1 

2 
3 

15" 
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Table 4 

PEHCESrr AGREESIENT BETWEEN TRAIKER/C»S£RVER 2 
WHO CODES 



Percent 
Aprreement 
91-100 
81-90 
71-80 
61-70 
51-60 
41-50 



Least Frequent 
(0-60) 



Moderately Frequent 
(61-175) 



Most Frequent 

(176-1,216) 
Adult, Child 



Total No. 
of Ood^s 



TOTAL 



WHAT CODES 



Percent 
Agreement 
91-100 
81-90 
71-80 
51-70 
51-60 
41-50 
31-40 
21-30 
11-20 



Least Frequent Moderately Frequent Most Frequent Total No. 



(0-60) 



(61-175) 



10 



4, 7 
9 



(176-1,216) 


3 

IQ 
12 
1 



TOTAL 



of Codes 
T — 

1 

3 

2 

2 



1 



HW CODES 



Percent 
Agreement 
91-100 
81-90 
71-80 
61-70 
51-60 
41-50 
31-40 
21-30, 
11-20 
0-10 



Least Frequent 
(0-60) 

DP 



T 



U, Q, G 
0, w 



Moderately Frequent 
(61-175) 



B 



Most Frequent 
(176-1,216) 

A 

HV 



Total No, 
of Codes 
2 



TOTAL 



1 
4 
1 
3 
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Table 5 



PAIRED OBSERVATION RESULTS 
Observer 2 Observer l 







Percent 
Agreement 


1 Observer 
Ratio 


Trainer 
Ratio 


Traine r 
Score 


Percent 
Agreement 


u 

0) 

> , 

QQ 4^ 

sx « 


Trainer 
Ratio 


Trainer 
Score 




Adult 


94 


• 


• OTcJ 


D«3o 




• 482 


• 4o7 






Child 


98 


• 4ol 


•/42o 


4b7 


9d 


• 488 


• 50o 


tL A Jk. 

544 




Machine 






• OUvl 






.023 


• U2o 




o 


TOTAL 




















FREQUENCY 




J. y J.UD 


± y J.UO 




1,160 


X y Uf O 






1 


60 


.174 


';i04 


115 




.171 


.150 


161 




IQ 


73 






1 AO 




.103 




inn . 




2 




.001 


• 000 


0* 


♦ 


.000 


• 001 


1* 




3 


89 


• 288 


.257 


283 


88 


.212 


• 242 


260 




4 


11 


.091 


• 070 


11 


99 


.150 


• 151 


162 




5 


92 


• 091 


.^•099 


J09 


65 


.086 


• 056 


60 




6 


17* 


• 006 


• 035 


38* 


81 


.069 


• 056 


60 




7 


75* 


• 039 


• 052 


57* 


56* 


.020 


• 036 


38* 




8 


100* 


• 022 


• 022 


24* 


44* 


.007 


• 016 


17* 


1 


Q 


55 


• 038 


• 069 


76 


65 


.042 


• 065 


69 






65* 


• 031 


• 020 


22* 


93* 


.013 


• 014 


15* 




n 


— ♦ 


• 020 


• 000 


0* 


75* 


.003 


• 004 


4* , 




12 


70 


• 087 


• 124 


137 


83 


.121 


• • 100 


107 




NV 


•/I 


• 144 


.203 


224 


73 


.239 


. 174 


187 


* 


X 


51* 


.021 


• 041 


45* 


43 


.053 


.023 


23* 




H 


22* 


• 004 


.018 


20* 


11 


.006 


.057 


61 




U 


25* 


• 016 


,.004 


4* 


100* 


.000 


.000 


0* 




N 


0* 


.000 


• 009 


10* 


8* 


.001 


• 012 


13* 


T 


33* 


• 003 


• 009 


9* 


33* 




• 009 


9* 




Q 


23* 


• 005 


• 022 


24* 


11* 


.001 


• 009 


9* 




G 


29* 


• 008 


• 028 


31* 


100* 


.033 


• 033 


35* 


S 


P 


— ♦ 


• 002 


.000 


0* 


♦ 


• 000 


• 001 


1* 


S3 


O 


— * 


• 000 


.001 


1* 


0* 


.000 


• 016 


17* 




w 


— * 


.000 


.002 


2* 


0* 


.000 


• 007 


7* 




DP 


100* 


,000 


.000 


0* 


100* 


.000 


• 000 


. 0* 




A 


93 


.670 


.622 


687 


94 


.760 


• 713 


767 




B 


16* 


.008 


.051 


56* 


62* 


.018 


• 029 


31* 



Fewer than 60 criterion instances. 



Ratio = occurrence of a specific code/total number of frames recorded 




agreement rate for the '^X" (movement) , '*h" (happy) » "n" (negative) , and 
"o" (object) codes. The remaining codes occurred less than ten times and, 
therefore, no accuracy rate could 'be arrived at. - 

Observer 2's rate of accuracy was similar on the "How" codes* 
She was reliable on the "nV," "DP," and "a" codes and below the 50 percent 
level on the "Q," "G," and "b" codes. Eight of the "How" codes occurred 
only ten or fewer times, thus generalizations regarding these codes would 
be made with caution. 

2 . Videotaped Skits (Simulations) 

The second phase of the reliability study was based on video- 
taped skits. The procedure followed is to have the observers code 
interactions seen on a videotape and compare that record with predeter- 
mined criteria. The tape has stops or pauses betwe'^n each interaction to 
insure that each observer knows which interaction to code. 

The results are then compiled for each observer, and they reveal 
both (1) which occurrences were not recorded ^d (2) which code was erroneously 
recorded in its place. The procedure allows us to identify the problem 
codes for each specific observer. 

In the figures that follow, two values are shown in each cell. 
For those cells that fall on the main diagonal, the upper value shows the 
percent of times the total number of codes recorded v*as correct. The 
lower value shov? the percent of times the code actUilly occurred and was 
recorded correctly by the obsei-ver. 

For cells that do not fall on the diagonal, the two values 
indicate proportions cf error rather than of accuracy. The upper value 
shows the percent of times a specific code (as shown by row indicator) 
was recorded instead of a specific criterion code (indicated by the 
column) to the total number of recordings of that code. The lower value 
indicates the percent of times that the specific code (indicated by the 
row) was recorded when a given criterion code was called foi^ (shown by the 
col<imn) . 

Figures 10, 11, 12, and 13 are matrices showing the percent of 
accuracy and the percent of the total codes recorded for the two observers. 
Computations are for the "\Vhat" and the "How** codes. The total number of 
criterion instances of each code is shown at the bottom of each column, 
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The total number that the observer recorded is given at the end of each 
row. 

Those codes that occurred fewer than seven times are listed in 
the matrices but will not be discussed in the body of this ^ext, A 
decision was made that, in these cases, the confidence level with which 
we might make predictions as to the reliability of an observer would be 
so low as to render it unacceptable. Therefore, only the codes which were 
tested by seven or more crite' ion examples will be considered in this 
analysis . 

As shown in Figure 10, the "What" matrix for Observer 1 indicates 
that, of the eight codes that included seven or more criterion instances, 
only tho "6" code (task-related statement) and the "SNV" code had a crite- 
rion accuracy lower than .70. In the "6" code, 80 percent of the recorded 
"6" codes were correct, but 69 percent of the criterion codes were missed. 
Moving up the "6** column we can see that 54 percent of the criterion "6" 
codes were incorrectly coded as "s," The problem with the "SNV" code is 
somewhat different. In this case the problem is that both the criterion 
rate and the observer correctness were low. It appears that on the 
simulations Observer 1 had difficulty distinguishing the "5NV" code from 
the "4NV" (self learning or instruction) code, since she often codes the 
criterion "4NV" instances as "SNV" and, also the "SNV" criterion as "4NV." 

Over all "What" codes » Observer 1 is reasonably accurate with a 
criterion rate of .76 which is average for all 63 observers examined by 
the videotapes on the "What" codes. 

Observer 2 had a reasonable overall criterion accuracy rate 
(.70) also, but she had coding problems with several codes (see Figure 11). 
She did not record the "4" code (instruction) 66 percent of the time. 
The "4NV," "6,* "7," and "9" codes were also coded less frequently than 
required; She used the "l2" code (observing) eight more tiKes than 
required. They were confused with codes "3," "4," and 

The "how" code accuracy for Observer 1 was also acceptable 
(see Figure 12). Her overall criterion accuracy rate was .78. This 
figure indicates that of the 111 criterion "How" codes presented, she 
recorded them correctly 87 times (see lower right hand corner of Figure 12). 



This figure is computed by dividing the total number of correct entries 
Q by the exact number of videotaped criterion examples. 
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The only How cdde that Observer, 1 recorded with less than 70 percent 
accuracy was the "o" code (objects); 50 percent were missed and 64 percent 
were recorded when not indicated. The other codes that fell below a .70 
rate of accuracy were codes that included fewer than seven criterion 
instances. 

Observer 2 had a more difficult time recording the "How" codes 
from the tapes. In Figure 13, her overall accuracy rate is shown as only 
,56. On individual codes, the "o" was very reliable (1,00/. 86), but the 
"nv" was not used 41 percent of the time required. The "A" (academic) 
was coded when not called for sixteen times as well as omitted eig^t times 
when it should have been coded, "b" (behavior) ajid "DP" (dramatic play) 
were ignored completely, 
3 . Ck>nclusions 

Two distinct procedures, the videotaped skits and the paired' 
observations, were used to assess the accuracy of two observers. The 
results indicate average reliability for both observers on the "What" 
code category. For the "How" category. Observer 1 is above average, but 
Observer 2 is below the average of the other 62 observers. 

Specifically, Observer 1 was acceptably accurate on the more 
frequently used individual codes. Many of her codes, siich as the "iQ," 

3, 4, 7, 9, 12, NV, A, and B were sliown to be very reliable 
on both procedures. Only the "o" code (use of objects) was shown to be 
unreliable on both procedures. 

The results were equally good for Observer 2 on the "What" codes 
with only the "6" code (task related comment) being shown unreliable on 
both procedures. The "How" code "b" (behavior) was also recorded poorly 
in both procedures. In the case of the videotape codings she inissed the 
13 examples of the "b" code and underestimated it in the paired observations 
On the "a" code (academic). Observer 2 was 93 percent accurate or the 
paired observations but had a .63//77 reliability on the videotapes. 
Other *'How" codes such as "movement^^ and "object** are acceptably accurate 
on the videotapes while **nonverbal** is acceptably accurate on the paired 
observations. **Guide" and **question, ** which were underestimated in the 
inter-rater analysis, have too few examples on the videotape to be 
discussed in terms of reliability. 

While simulated videotaped events are limited in their scope 
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and differ from the classroom situation, they do offer a standard stimulus 
to examine each observer's ability to code specified events and to identify 
observer bias. There is still some confounding in the source of "system 
error ; however, the variation introduced by a second observer is 
eliminated, Yrhile the two systems of examining observer accuracy do yield 
some different information^ it is not contradictory, and the videotape 
system is by fa- : asier to interpret. 
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